For this exercise we're going to learn about how we use complex probability distributions to generate data. In other words, we're going to generate plausible, fake text from a body of text. To help out, we're going to need Python's NLTK and Numpy libraries.


In [115]:
import nltk, random, numpy, textwrap

In [3]:
nltk.corpus.gutenberg.fileids()


Out[3]:
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

NLTK has a neat corpus module with sets of test data to use. Within the files for project gutenberg, we've got a nice selection of classics to choose from. We'll be choosing "Moby Dick" because it is particularly long. Our first task is to read the data into a Python string and split it into words so we can process it.


In [7]:
training = nltk.corpus.gutenberg.open('melville-moby_dick.txt').read()

In [8]:
training_words = training.split()

Now we will take a closer look at the data.


In [9]:
training_words[0]


Out[9]:
u'[Moby'

Ah, well that's a problem. The first words of this file look like a header. So let's move a little ways down the list until we find a sentence start after the header. With a large enough body of text this won't matter, but for our purposes let's make it easier on our selves by making sure our data is higher quality. High quality data is the first step towards better predictions, which is why in practice data scientists spend much of their time on it.


In [15]:
for idx, (w1, w2) in enumerate(zip(training_words, training_words[1:])):
    if w1 == "Call" and w2 == "me":
        print "Ishmael", idx
        break


Ishmael 3608

Wow! That's pretty far down the page. Let's verify before moving forward.


In [16]:
print(training_words[3550:3650])


[u'--WHARTON', u'THE', u'WHALE', u'KILLER.', u'"So', u'be', u'cheery,', u'my', u'lads,', u'let', u'your', u'hearts', u'never', u'fail,', u'While', u'the', u'bold', u'harpooneer', u'is', u'striking', u'the', u'whale!"', u'--NANTUCKET', u'SONG.', u'"Oh,', u'the', u'rare', u'old', u'Whale,', u'mid', u'storm', u'and', u'gale', u'In', u'his', u'ocean', u'home', u'will', u'be', u'A', u'giant', u'in', u'might,', u'where', u'might', u'is', u'right,', u'And', u'King', u'of', u'the', u'boundless', u'sea."', u'--WHALE', u'SONG.', u'CHAPTER', u'1', u'Loomings.', u'Call', u'me', u'Ishmael.', u'Some', u'years', u'ago--never', u'mind', u'how', u'long', u'precisely--having', u'little', u'or', u'no', u'money', u'in', u'my', u'purse,', u'and', u'nothing', u'particular', u'to', u'interest', u'me', u'on', u'shore,', u'I', u'thought', u'I', u'would', u'sail', u'about', u'a', u'little', u'and', u'see', u'the', u'watery', u'part', u'of', u'the', u'world.', u'It']

Okay, looks good. Now let's zoom in.


In [17]:
print(training_words[3608:3612])


[u'Call', u'me', u'Ishmael.', u'Some']

Alright, 3608 is definitely our starting point. And now we see another issue too. If we're going to generate plausible text, we need to take into account sentences not just words.

Markov Chains

Markov chains let us predict the probability one word will be followed by another word, mapping out a directed graph of state transitions between the different words in our body of text. In this way we find out something about grammar without having to encode it in our model. For more information, take a minute to read here: https://en.wikipedia.org/wiki/Markov_chain


In [129]:
def build_model(words):
    last_word = "_START" # Initialize from a starting point
    model = {"_START": {}}
    for word in words:
        try:
            prev = model[last_word]
        except KeyError:
            model[last_word] = {word: 1}
        else:
            if word in prev:
                prev[word] += 1
            else:
                prev[word] = 1
    
        if word.endswith('.') or word.endswith('!'):
            last_word = "_START"
        else:
            last_word = word
    return model

Okay, now that we've created a function to build our model, let's try it out and see what we get.


In [130]:
mk = build_model(training_words[3608:])
print(len(mk))


28558

It looks like we have a lot of top-level keys. This means there are 28558 states for which we've recorded transition frequencies. Let's check out our top starting words.


In [132]:
sorted(mk["_START"].items(), key=lambda (k,v): -v)[:10]


Out[132]:
[(u'But', 582),
 (u'The', 395),
 (u'I', 278),
 (u'And', 265),
 (u'It', 231),
 (u'In', 190),
 (u'He', 168),
 (u'For', 154),
 (u'CHAPTER', 105),
 (u'A', 97)]

If they're all capitalized like this, we're doing well. Now let's see if we can do something a little more complicated. Let's find the top ranked pairs of words by using a two level list comprehension.


In [48]:
pairs = sorted((-freq, (w1,w2)) for w1, follows in mk.items() for w2, freq in follows.items() if w1 != "_START")

In [49]:
for freq, (w1, w2) in pairs[:50]:
    print "{}: {} {}".format(-freq, w1, w2)


1821: of the
1088: in the
698: to the
421: from the
353: of his
350: and the
318: on the
316: of a
312: at the
306: with the
304: to be
294: by the
282: for the
246: in his
242: into the
235: in a
228: with a
216: upon the
211: that the
196: as the
193: all the
166: out of
165: it is
162: it was
157: for a
155: the whale
153: like a
147: the same
142: one of
139: to his
136: is the
133: over the
126: sort of
123: was a
122: as a
122: as if
121: of all
121: such a
116: in this
116: of this
113: of that
110: had been
110: have been
109: is a
109: to a
108: and then
107: with his
106: the most
104: I have
104: the other

Now we know the top pairs, but we have to make weighted random transitions, not just list the top pairs. With a little help from numpy generating a cumulative sum of frequencies so we can tranform [('The', 1), ('if', 7), ('as', 3)] into [('The', 1), ('if', 8), ('as', 11)], we can make a weighted random choice for the next word given a particular prior word.


In [99]:
def random_choice(choices):
    ch_list = list(sorted(choices.items(), key=lambda (k,v): v))  # [('w1', 1), ('w2', 3), ... ]
    cs = list(numpy.cumsum([v for k, v in ch_list]))
    cum_choices = [(cf,w) for cf, (w, f) in zip(cs, ch_list)]
    choice = random.randint(1, cum_choices[-1][0])
    for cf, value in cum_choices:
        if choice <= cf:
            return value
    else:
        raise ValueError("Random range out of bounds")

In [108]:
x = random_choice(mk["_START"])

In [109]:
print (x, mk['_START'][x])


(u'The', 395)

Now that we've got a function to make random weighted choices, all that's left is to create our generation function. In the end, this is a simple algorithm:

  1. Have we generated enough words yet and is the current sentence over? If no, proceed to 2. If yes, finish.
  2. Is the previous word the end of a sentence or is there no previous word? Proceed to 3. Otherwise Proceed to 4.
  3. Generate random choice from sentence starting words. Proceed to 1.
  4. Generate random next word based on current state and add to end of word list. Proceed to 1.

In [119]:
def generate(min_words, weights):
    words = []
    while len(words) < min_words or not words[-1].endswith("."):
        
        try:
            prev = "_START" if words[-1].endswith(".") or words[-1].endswith("!") else words[-1]
        except IndexError:
            prev = "_START"
            
        words.append(random_choice(weights[prev]))
    return " ".join(words)

In [122]:
print (generate(30, mk))


But, by the creature as Ahab, as a wooden stock fish. But I seem to it, thou canst consume; but a word ROSE, and to his future ages, I was not been slily hovering over golden finger darted that mast, an inferior fellows all manner in my tambourine in the last come to him; but he did not.

Nice! Let's generate a longer one now. This might wrap in the middle of a word, so let's make use of Python's textwrap module to word wrap our output and generate a whole page.


In [123]:
print( "\n".join(textwrap.wrap(generate(500, mk), 80)))


(RECLINING AND ALL SORTS LYING ABOUT THE STERN WINDOWS; AHAB STANDING BEFORE HIS
VICE-BENCH, AND BY HIS CAP.) It's worse than their leader originally educated
according to some cases such fervent rays, that like a quaint craft steeply
leaning against mine, when I saw him in it takes away then; pile themselves over
it is ballasted with a perturbation. And now, I pass. Immediately the Jeroboam's
boat where this thing he were lost their dreams of any path in the Bachelor's
men sailors mark this ship--widows and that broke out to the carpenter here (by
the Nantucketer, out of unknown to its far had voluntarily shipped for it to
Nantucket captains of full of Birmah, forms and beheaded; and islets of it; far
as little Johnny in the ear has de damndest row as he takes it be discovered.
(RECLINING ON DECK; PIP CATCHES HIM BY THE WINDOW.) 'Twas rehearsed by and not
rudely down, a panic; and tow dismasted frigates in vain remonstrating against
the dark, purplish, yellow sea.* *That part of ships in keeping that all seasons
a dismasted man at the old established family likeness! the cause, and
wrinkles." "It is an intelligent spirit that never mind how hard to another, and
that you really beautiful and solitary twain; openly, and considering the van,
this wide waters. For again for mere fright; and predestinated day was on the
anvil--the red flag come down!" When Stubb now calmly rose, and frankly thrust
back upon thousands of the summer are wonders of the storm for the person I
myself--that is the water-bearer, Flask; so much as the hump. What's that goney
was gliding strangeness began to sacrifice of their bloated livers and thorough
whaleman, you please, furnish to be facetious than billiard-tables, and spoke
it; this whale." "It was the White Whale, after breaking are not used in the way
of their boats pursued, he must mount to be forced to the skeleton would then
streamed behind him. Pray, what tune is glued in itself, and to understand, that
Japanese seas have early oozed along it, as possible, watching the shore the
crotch, and it is mystically carved in his wild ocean rolled. Blood and sent the
oldest mariner who slights it. From the bottom of the whale, the heavens and
thrown up the harpooneers of sight; an ill effect, of the whale, the harpooneer
hears he is an air-freighted demijohn. I told him so. So that sight, completely
digest even assuming the islands, battled with it; and portents and incidentally
suggested to hand, as that club-hammer there on the revolving outer end to this
earth. After many of superstitious probability. Do ye unless they are always
among them, with great a question to him, wounding and whale-ports; this book,
you say dat whale. You would find their unctuousness is whittled down in this;
only clear Truth is his." "We have for a sagacious savage, he will sometimes
happens to quote:-- SACRED TO THE MAINMAST; STARBUCK APPROACHING HIM. It would
come twenty feet long will derive the investigator can hold, where an interlude
and all their own scythes--though in shades you burn, but on the circumstances,
its rack, within his slippery back, and unexaggerating historian, except in his
noon the benefit of gun-metal, stands there, a very much weary for the deck, and
to be otherwise unaccountable odds the Fejee god made the desperate scene.

We've done it! With no understanding of grammar or meaning, we've learned how to generate semi-plausible text. With some help from NLTK, we can improve this further by using stemming and punctuation detection to build a more complex probability distribution. NLTK can help a lot with this, but for now we'll stop here.